{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "09 - Building abstractive text summaries",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Pjmz-RORV8E"
      },
      "source": [
        "# Building abstractive text summaries\n",
        "\n",
        "_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._\n",
        "\n",
        "In the field of text summarization, there are two primary categories of summarization, extractive and abstractive summarization.\n",
        "\n",
        "Extractive summarization takes subsections of the text and joins them together to form a summary. This is commonly backed by graph algorithms like TextRank to find the sections/sentences with the most commonality. These summaries can be highly effective but they are unable to transform text and don't have a contextual understanding.\n",
        "\n",
        "Abstractive summarization uses Natural Language Processing (NLP) models to build transformative summaries of text. This is similar to having a human read an article and asking what was it about. A human wouldn't just give a verbose reading of the text. This notebook shows how blocks of text can be summarized using an abstractive summarization pipeline. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dk31rbYjSTYm"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XMQuuun2R06J"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PNPJ95cdTKSS"
      },
      "source": [
        "# Create a Summary instance\n",
        "\n",
        "The Summary instance is the main entrypoint for text summarization. This is a light-weight wrapper around the summarization pipeline in Hugging Face Transformers.\n",
        "\n",
        "In addition to the default model, additional models can be found on the [Hugging Face model hub](https://huggingface.co/models?pipeline_tag=summarization).\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nTDwXOUeTH2-"
      },
      "source": [
        "%%capture\n",
        "\n",
        "from txtai.pipeline import Summary\n",
        "\n",
        "# Create summary model\n",
        "summary = Summary()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-vGR_piwZZO6"
      },
      "source": [
        "# Summarize text\n",
        "\n",
        "The example below shows how a large block of text can be distilled down into a smaller summary."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "-K2YJJzsVtfq",
        "outputId": "cdf54f20-72ad-4f65-bc17-100e32e6cc71"
      },
      "source": [
        "text = (\"Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation \"\n",
        "       \"of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is \"\n",
        "       \"rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability \"\n",
        "       \"allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models \"\n",
        "       \"and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine \"\n",
        "       \"that enables Natural Language Understanding (NLU) based search in any application.\"\n",
        ")\n",
        "\n",
        "summary(text, maxlength=10)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'Search is the foundation of the internet'"
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n2jndgE-JyWX"
      },
      "source": [
        "Notice how the summarizer built a sentence using parts of the document above. It takes a basic understanding of language in order to understand the first two sentences and how to combine them into a single transformative sentence."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "27PneZxQx7NR"
      },
      "source": [
        "# Summarize a document\n",
        "\n",
        "The next section retrieves an article, extracts text from it (more to come on this topic) and summarizes that text."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "id": "idPThgJGvIju",
        "outputId": "7d0580e6-2531-48c9-a32a-481ccf32900d"
      },
      "source": [
        "!wget -q \"https://medium.com/neuml/time-lapse-video-for-the-web-a7d8874ff397\"\n",
        "\n",
        "from txtai.pipeline import Textractor\n",
        "\n",
        "textractor = Textractor()\n",
        "text = textractor(\"time-lapse-video-for-the-web-a7d8874ff397\")\n",
        "\n",
        "summary(text)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'Time-lapse video is a popular way to show an area or event over a long period of time. The same concept can be applied to a dynamic real-time website with frequently updated data. webelapse is an open source project developed to provide this functionality. It can be used as is or modified for different use cases.'"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a63k89aDyKTW"
      },
      "source": [
        "Click through the link to see the full article. This summary does a pretty good job of covering what the article is about!"
      ]
    }
  ]
}